The data set used in this project is Kaggle ML and Data Science Survey 2017. The survey was stored in 2 different data sets:a) multiple choice items, b) free-response items. Kaggle stored each data in csv format. We dowloaded the multiple choice item survey results in csv format and placed it in our GitHub repo
Importing Multiple Choice data
linkMC<-"https://raw.githubusercontent.com/betsyrosalen/DATA_607_Project_3/master/project3_master/rawdata/multipleChoiceResponses.csv"
#importing MC items
MC<-read_csv (linkMC)
dim(MC)## [1] 16716 228
#lets create a unique ID variable
MC$id <- seq.int(nrow(MC))Ignore this codeImporting conversionrates data incase we want to do analyses
# link_conversion<-"https://raw.githubusercontent.com/betsyrosalen/DATA_607_Project_3/master/project3_master/rawdata/conversionRates.csv"
# #importing MC items
# conversion<-read_csv (link_conversion)
# dim(conversion)
# #lets create a unique ID variable
# conversion$id <- seq.int(nrow(conversion))This project will answer this globalresearch question Which are the most values data science skills? The following 6 research questions will provide answer to this global question.
What is the relationship between the most popular platforms for learning DS and X (Niteen)? Alternatively phrased: What data science learning resources and which locations of open data are utilized by people of varying levels of education? (delete me if you need to!)
This section will describe the name of the variables and their labels (as reported in schema doc) and how the values were codes (etc yes, no, select all)
Does survey takers’ formal education has any relationship to the ML/DS method he or she is most excited about learning in the next year? (Binish)
This section will describe the name of the variables and their labels (as reported in schema doc) and how the values were codes (etc yes, no, select all)
What are the most frequently used DS methods? Where is the most time spent in terms of working with data? Do either of these correlate with job title or level of education? (Zach)
This section will describe the name of the variables and their labels (as reported in schema doc) and how the values were codes (etc yes, no, select all)
Is there a difference between what ‘Learners’ think are the important skills to learn and what employed Data Scientists say are the skils and tools they are using? (Betsy)
This section will describe the name of the variables and their labels (as reported in schema doc) and how the values were codes (etc yes, no, select all)
Is there any interaction between the Kaggle survey takers’ program language use (R or Python) and their recommended program languages? (e.g. R users recommending R more than Python users recommending Python) (Burcu)
This section will describe the name of the variables and their labels (as reported in schema doc) and how the values were codes (etc yes, no, select all)
dim(MC)## [1] 16716 229
tb1<-MC %>%
select (id, WorkToolsSelect) %>%
filter (id %in% c(1:6))
datatable(tb1)#removing NAs and empty values in column=WorkToolsSelect
df <- MC[!(MC$WorkToolsSelect == "" | is.na(MC$WorkToolsSelect)), ]
dim(df)## [1] 7955 229
tb2<-df %>%
select (id, WorkToolsSelect) %>%
filter (id %in% c(1:6))
datatable(tb2)#creating a new variable called work_tools where the original column values are split
#please note that this code will generate long data
df1<-df %>%
mutate(work_tools = strsplit(as.character(WorkToolsSelect), ",")) %>%
unnest(work_tools)
#check
tb3<-df1 %>%
select (id, WorkToolsSelect,work_tools) %>%
filter (id %in% c(1:3))
datatable(tb3) df2<-df1 %>%
group_by(id, work_tools) %>%
summarize (total_count = n()) %>%
spread( work_tools, total_count, fill=0)
df3<-df2 %>%
mutate(lang_use = case_when (
(R==1 & Python==0) ~ "Using R Only",
(R==0 & Python==1) ~ "Using Python only",
(R==1 & Python==1) ~ "Using Both Python and R",
(R==0 & Python==0) ~ "Using Neither Python nor R"))%>%
select (id, R, Python, lang_use)
tb4<-df3 %>%
filter (id %in% c(1:10))
datatable(tb4)#computing percentages
df4<-df3 %>%
group_by(lang_use) %>%
summarize (total_count = n()) %>%
mutate(percent = ((total_count / sum(total_count)) * 100), percent=round(percent, digit=2))
#checking
datatable(df4, colnames=c("Programming Language Survey takers use", "Count", "Percent"),class = 'cell-border stripe',caption = 'Table 1: Descriptive Statistics',options = list(pageLength = 2, dom = 'tip'))p<-ggplot (df4, aes(x=lang_use,y=percent,fill=lang_use )) +
geom_bar(stat="identity", width =.5) +
labs (x="Language ", y="The distribution of R and Python among their users (%) " ,
title="Bar Graph of R and Python users") +
theme(axis.text.x = element_text(angle = 90)) +
scale_y_continuous (breaks=seq(0,100,10), limits = c(0,100))
ggplotly(p)Let’s examine the above graph by LanguageRecommendationSelect
#check
tb5<-df1 %>%
select (id, WorkToolsSelect,work_tools, LanguageRecommendationSelect) %>%
filter (id %in% c(1:3))
datatable(tb5) df5<-df1 %>%
group_by(id, work_tools,LanguageRecommendationSelect) %>%
summarize (total_count = n()) %>%
spread( work_tools, total_count, fill=0) %>%
mutate(lang_use = case_when (
(R==1 & Python==0) ~ "Using R Only",
(R==0 & Python==1) ~ "Using Python only",
(R==1 & Python==1) ~ "Using Both Python and R",
(R==0 & Python==0) ~ "Using Neither Python nor R"),
lang_rec = case_when (
(LanguageRecommendationSelect=="R") ~ "Recommending R ",
(LanguageRecommendationSelect=="Python" ) ~ "Recommending Python ",
(LanguageRecommendationSelect!="R" |LanguageRecommendationSelect!="Python") ~ "Recommending Neither Python nor R",
(LanguageRecommendationSelect=="NA"|LanguageRecommendationSelect==" " ) ~ "Recommending Nothing"))%>%
select (id, R, Python, lang_use,lang_rec )
dim(df5)## [1] 7955 5
tb6<-df5 %>%
filter (id %in% c(1:10))
datatable(tb6) #computing percentages
df6<-df5 %>%
group_by(lang_use,lang_rec) %>%
summarize (total_count = n()) %>%
mutate(percent = ((total_count / sum(total_count)) * 100), percent=round(percent, digit=2))
#checking
datatable(df6, colnames=c("Programming Language Survey takers use", "Count", "Percent"),class = 'cell-border stripe',caption = 'Table 1: Descriptive Statistics',options = list(pageLength = 2, dom = 'tip'))p1<-ggplot (df6, aes(x=lang_use,y=percent,fill=lang_use )) +
geom_bar(stat="identity", width =.5) +
labs (x="Language ", y="The distribution of R and Python among their users (%) " ,
title="Bar Graph of R and Python users and their recommended language") +
theme(axis.text.x = element_text(angle = 90)) +
scale_y_continuous (breaks=seq(0,100,10), limits = c(0,100))+
facet_wrap(~lang_rec)+
theme(legend.position = 'none')
ggplotly(p1)Of those receiving pay in US Dollars, is Python or R overall most profitable for a Kaggle survey taker? (Gabby)
This section will describe the name of the variables and their labels (as reported in schema doc) and how the values were codes (etc yes, no, select all)
RQ6 <- MC %>%
mutate(work_tools = strsplit(as.character(WorkToolsSelect), ",")) %>%
unnest(work_tools)
RQ6 <- RQ6 %>%
filter(!is.na(WorkToolsSelect)) %>% # Filters out all columns with NA in the WorkToolsSelect column
filter(CompensationCurrency == "USD") %>% # Makes sure to only use rows whose currency is in USD
filter(work_tools == "Python" | work_tools == "R") %>% # The work tools are R or Python, period.
select(id, work_tools, CompensationAmount) # Only have three rows to work with
RQ6_ids <- select(filter(as.data.frame(table(RQ6$id)), Freq == 1), Var1) # Only want people who use R or Python EXCLUSIVELY, not R and/or Python
RQ6_ids <- droplevels(RQ6_ids)$Var1 # Removed the levels so we can actually get the IDs
RQ6 <- filter(RQ6, id %in% RQ6_ids) # Only keep those rows whose id are inside of list of ids with R or Python exclusively used at work
RQ6 <- select(RQ6, -id) # No use for the ID anymore, it's done its job
RQ6$CompensationAmount <- gsub(",", "", RQ6$CompensationAmount) # Removed the commas from the compensation amount to prep for numeric transformation
RQ6$CompensationAmount <- as.numeric(RQ6$CompensationAmount) # made the column into a numeric for easier mathematical comparison and sorting
RQ6 <- filter(RQ6, CompensationAmount < 9999999) # ... let's just be a little realistic, nobody is earning more than fifteen million a year at this point in time or prior to it, and this one-dollar-off-from-a-million entry is an anomaly in the data set
rm(RQ6_ids) # remove the now-unused variable to save memoryRQ6_boxplot <- ggplot(RQ6) +
geom_boxplot( aes(x = factor(work_tools),
y = CompensationAmount,
fill = factor(work_tools)
)
) +
scale_y_continuous(breaks=seq(0,2000000,25000)) +
labs( x = "Programming Language",
y = "Annual Compensation in USD",
fill = "Programming Language")
RQ6_boxplot_ylim <- boxplot.stats(RQ6$CompensationAmount)$stats[c(1, 5)]
RQ6_boxplot <- RQ6_boxplot + coord_cartesian(ylim = RQ6_boxplot_ylim*1.05)
RQ6_boxplotThe average survey taker who used Python in their job made approximately $14,648.50 more than the average survey taker who used R in their job. While R users overall had a higher base pay - to the tune of $5,000.00 more than their Python counterparts - their ability to achieve growth in salary was noticeably stymied in comparison. Outliers aside, if the data collected is to be considered representative of the data science population, there is indication that a prospective Data Scientist should learn R first for a higher initial salary, and then learn Python to increase their chance of obtaining a job with more growth potential.